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Abstract 

We  describe  a  new  corpus  of  over  180,000  hand- 
annotated  dialog  act  tags  and  accompanying  adjacency 
pair  annotations  for  roughly  72  hours  of  speech  from  75 
naturally-occurring  meetings.  We  provide  a  brief  sum¬ 
mary  of  the  annotation  system  and  labeling  procedure, 
inter-annotator  reliability  statistics,  overall  distributional 
statistics,  a  description  of  auxiliary  files  distributed  with 
the  corpus,  and  information  on  how  to  obtain  the  data. 

1  Introduction 

Natural  meetings  offer  rich  opportunities  for  studying  a 
variety  of  complex  discourse  phenomena.  Meetings 
contain  regions  of  high  speaker  overlap,  affective  varia¬ 
tion,  complicated  interaction  structures,  abandoned  or 
interrupted  utterances,  and  other  interesting  turn-taking 
and  discourse-level  phenomena.  In  addition,  meetings 
that  occur  naturally  involve  real  topics,  debates,  issues, 
and  social  dynamics  that  should  generalize  more  readily 
to  other  real  meetings  than  might  data  collected  using 
artificial  scenarios.  Thus  meetings  pose  interesting  chal¬ 
lenges  to  descriptive  and  theoretical  models  of  dis¬ 
course,  as  well  as  to  researchers  in  the  speech 
recognition  community  [4,7,9,13,14,15]. 

We  describe  a  new  corpus  of  hand-annotated  dialog  acts 
and  adjacency  pairs  for  roughly  72  hours  of  naturally 
occurring  multi-party  meetings.  The  meetings  were  re¬ 
corded  at  the  International  Computer  Science  Institute 
(ICSl)  as  part  of  the  ICSl  Meeting  Recorder  Project  [9]. 
Word  transcripts  and  audio  files  from  that  corpus  are 
available  through  the  Linguistic  Data  Consortium 
(LDC).  In  this  paper,  we  provide  a  first  description  of 
the  meeting  recorder  dialog  act  (MRDA)  corpus,  a 
companion  set  of  annotations  that  augment  the  word 
transcriptions  with  discourse-level  segmentations,  dia¬ 
log  act  (DA)  information,  and  adjacency  pair  informa¬ 
tion.  The  corpus  is  currently  available  online  for 
research  purposes  [16],  and  we  plan  a  future  release 
through  the  LDC. 


2  Data 

The  ICSl  Meeting  Corpus  data  is  described  in  detail  in 
[9].  It  consists  of  75  meetings,  each  roughly  an  hour  in 
length.  There  are  53  unique  speakers  in  the  corpus,  and 
an  average  of  about  6  speakers  per  meeting.  Reflecting 
the  makeup  of  the  Institute,  there  are  more  male  than 
female  speakers  (40  and  13,  respectively).  There  are 
a28  native  English  speakers,  although  many  of  the 
nonnative  English  speakers  are  quite  fluent.  Of  the  75 
meetings,  29  are  meetings  of  the  ICSl  meeting  recorder 
project  itself,  23  are  meetings  of  a  research  group 
focused  on  robustness  in  automatic  speech  recognition, 
15  involve  a  group  discussing  natural  language 
processing  and  neural  theories  of  language,  and  8  are 
miscellaneous  meeting  types.  The  last  set  includes  2 
very  interesting  meetings  involving  the  corpus 
transcribers  as  participants  (example  included  in  [16]). 

3  Annotation 

Annotation  involved  three  types  of  information: 
marking  of  DA  segment  boundaries,  marking  of  DAs 
themselves,  and  marking  of  correspondences  between 
DAs  (adjacency  pairs,  [12]).  Each  type  of  annotation  is 
described  in  detail  in  [7].  Segmentation  methods  were 
developed  based  on  separating  out  speech  regions 
having  different  discourse  functions,  but  also  paying 
attention  to  pauses  and  intonational  grouping.  To 
distinguish  utterances  that  are  prosodically  one  unit  but 
which  contain  multiple  DAs,  we  use  a  pipe  bar  (  I  )  in 
the  annotations.  This  allows  the  researcher  to  either  split 
or  not  split  at  the  bar,  depending  on  the  research  goals. 

We  examined  existing  annotation  systems,  including 
[1,2,5,6,8,10,1 1],  for  similarity  to  the  style  of  interaction 
in  the  ICSl  meetings.  We  found  that  SWBD-DAMSL 
[11],  a  system  adapted  from  DAMSL  [6],  provided  a 
fairly  good  fit.  Although  our  meetings  were  natural,  and 
thus  had  real  agenda  items,  the  dialog  was  less  like 
human-human  or  human-machine  task-oriented  dialog 
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Figure  1 :  Mapping  of  MRDA  tags  to  SWBD-DAMSL  tags.  Tags  in  boldface  are  not  present  in  SWBD-DAMSL  and  were 
added  in  MRDA.  Tags  in  italics  are  based  on  the  SWBD-DAMSL  version  but  have  had  meanings  modified  for  MRDA.  The 
ordering  of  tags  in  the  table  is  explained  as  follows:  In  the  mapping  of  DAMSL  tags  to  SWBD-DAMSL  tags  in  the  SWBD- 
DAMSL  manual,  tags  were  ordered  in  categories  such  as  “Communication  Status”,  “Information  Requests”,  and  so  on.  In 
the  mapping  of  MRDA  tags  to  SWBD-DAMSL  tags  here,  we  have  retained  the  same  overall  ordering  of  tags  within  the  table, 
but  we  do  not  explicitly  mark  the  higher-level  SWBD-DAMSL  categories  in  order  to  avoid  confusion,  since  categorical 
structure  differs  in  the  two  systems  (see  [7]). 


(e.g.,  [1,2,10])  and  more  like  human-human  casual 
conversation  ([5,6,8,11]).  Since  we  were  working  with 
English  rather  than  Spanish,  and  did  not  view  a  large  tag 
set  as  a  problem,  we  preferred  [6,11]  over  [5,8]  for  this 
work.  We  modified  the  system  in  [11]  a  number  of 
ways,  as  indicated  in  Figure  1  and  as  explained  further 
in  [7].  The  MRDA  system  requires  one  “general  tag” 
per  DA,  and  attaches  a  variable  number  of  following 
“specific  tags”.  Excluding  nonlabelable  cases,  there  are 
1 1  general  tags  and  39  specific  tags.  There  are  two  dis¬ 
ruption  forms  (%-,  %— ),  two  types  of  indecipherable 
utterances  (x,  %)  and  a  non-DA  tag  to  denote  rising  tone 
(rt). 

An  interface  allowed  annotators  to  play  regions  of 
speech,  modify  transcripts,  and  enter  DA  and  adjacency 
pair  information,  as  well  as  other  comments.  Meetings 
were  divided  into  10  minute  chunks;  labeling  time  aver¬ 
aged  about  3  hours  per  chunk,  although  this  varied  con¬ 
siderably  depending  on  the  complexity  of  the  dialog. 

4  Annotated  Example 

An  example  from  one  of  the  meetings  is  shown  in  Fig¬ 
ure  2  as  an  illustration  of  some  of  the  types  of  interac¬ 
tions  we  observe  in  the  corpus.  Audio  files  and 
additional  sample  excerpts  are  available  from  [16].  In 
addition  to  the  obvious  high  degree  of  overlap — ^roughly 


one  third  of  all  words  are  overlapped — note  the  explicit 
struggle  for  the  floor  indicated  by  the  two  failed  floor 
grabbers  (fg)  by  speakers  c5  and  c6.  Furthermore,  6  of 
the  19  total  utterances  express  some  form  of  agreement 
or  disagreement  (arp,  aa,  and  nd)  with  previous  utter¬ 
ances.  Also,  of  the  19  utterances  within  the  excerpt,  9 
are  incomplete  due  to  interruption  by  another  talker,  as 
is  typical  of  many  regions  in  the  corpus  showing  high 
speaker  overlap.  We  find  in  related  work  that  regions  of 
high  overlap  correlate  with  high  speaker  involvement, 
or  “hot  spots”  [15].  The  example  also  provides  a  taste 
of  the  frequency  and  complexity  of  adjacency  pair  in¬ 
formation.  For  example,  within  only  half  a  minute, 
speaker  c5  has  interacted  with  speakers  c3  and  c6,  and 
speaker  c6  has  interacted  with  speakers  c2  and  c5. 

5  Reliability 

We  computed  interlabeler  reliability  among  the  three 
labelers  for  both  segmentation  (into  DA  units)  and  DA 
labeling,  using  randomly  selected  excerpts  from  the  75 
labeled  meetings.  Since  agreement  on  DA  segmentation 
does  not  appear  to  have  standard  associated  metrics  in 
the  literature,  we  developed  our  own  approach.  The 
philosophy  is  that  any  difference  in  words  at  the 
beginning  and/or  end  of  a  DA  could  result  in  a  different 
label  for  that  DA,  and  the  more  words  that  are 
mismatched,  the  more  likely  the  difference  in  label.  As 
a  very  strict  measure  of  reliability,  we  used  the 
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Figure  2:  Example  from  meeting  Bmr023.  Time  marks  are  truncated  here;  actual  resolution  is  10  msec.  “Chan”:  channel 
(speaker);  “DA”;  full  dialog  act  label  (multiple  tags  are  separated  by  incomplete  DA;  “xx  -  xx”:  disfluency  inter¬ 

ruption  point  between  words;  “xx-”:  incomplete  word;  “AP”:  adjacency  pairs  (use  arbitrary  identifiers).  For  purposes  of  illus¬ 
tration,  overlapped  speech  regions  are  indicated  in  the  figure  by  reverse  font  color.  Audio  and  other  samples  available  from  [16]. 


following  approach:  (1)  Take  one  labeler’s  transcript  as 
a  reference.  (2)  Look  at  each  other  labeler’s  words.  For 
each  word,  look  at  the  utterance  it  comes  from  and  see  if 
the  reference  has  the  exact  same  utterance.  (3)  If  it  does, 
there  is  a  match.  Match  every  word  in  the  utterance,  and 
then  mark  the  matched  utterance  in  the  reference  so  it 
cannot  be  matched  again  (this  prevents  felicitous 
matches  due  to  identical  repeated  words).  (4)  Repeat 
this  process  for  each  word  in  each  reference-labeler 
pair,  and  rotate  to  the  next  labeler  as  the  reference.  Note 
that  this  metric  requires  perfect  matching  of  the  full 
utterance  a  word  is  in  for  that  word  to  be  matched.  For 
example  in  the  following  case,  labelers  agree  on  3  seg¬ 
mentation  locations,  but  the  agreement  on  our  metric  is 
only  0.14,  since  only  1  of  7  words  is  matched; 

.  yeah  .  I  agree  if  s  a  hard  decision  . 

.  yeah  .  I  agree  .  if  s  a  hard  decision  . 

Overall  segmentation  results  on  this  metric  are  provided 
by  labeler  pair  in  Table  1. 

We  examined  agreement  on  DA  labels  using  the  Kappa 
statistic  [3],  which  adjusts  for  chance  agreement. 
Because  of  the  large  number  of  unique  full  label 
combinations,  we  report  Kappa  values  in  Table  2  using 
various  class  mappings  distributed  with  the  corpus. 
Values  are  shown  by  labeler  pair. 


Table  1 :  Results  for  strict  segmentation  agreement  metric 
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Table  2:  Kappa  values  for  DAs  using  different  class  mappings. 
Map  1 :  Disruptions  vs.  backchannels  vs.  fillers  vs.  statements 
vs.  questions  vs.  unlabelable;  does  not  break  at  the  “I”.  Map  2: 
Same  as  Map  1  but  breaks  at  the  “I”.  Map  3:  Same  as  Map  2 
but  breaks  down  fillers  and  questions  into  further  subclasses. 
See  [16]  for  further  details. 
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The  overall  value  of  Kappa  for  our  basic,  six-way 
classmap  (Mapl)  is  0.80,  representing  good  agreement 
for  this  type  of  task. 


6  Distributional  Statistics 

We  provide  basic  statistics  based  on  the  dialog  act 
labels  for  the  75  meetings.  If  we  ignore  the  tag  marking 
rising  intonation  (rt),  since  this  is  not  a  DA  tag,  we  find 
180,218  total  tags.  Table  3  shows  the  distribution  of  the 
tags  in  more  detail. 


Table  3:  Distribution  of  tags.  Tags  are  listed  in  order  of 
descending  frequency;  values  are  percentages  of  the  180,218 
total  tags. 
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If  instead  we  look  at  only  the  1 1  obligatory  general  tags, 
for  which  there  is  one  per  DA,  and  if  we  split  labels  at 
the  pipe  bar,  the  total  is  113,560  (excluding  tags  that 
only  include  a  disruption  label).  The  distribution  of 
general  tags  is  shown  in  Table  4. 

Table  4:  Distribution  of  general  tags;  values  are  percentages  of 
1 13,560  total  general  tags. 
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7  Auxiliary  Information 

We  include  other  useful  information  with  the  corpus. 
Word-level  time  information  is  available,  based  on 
alignments  from  an  automatic  speech  recognizer. 
Annotator  comments  are  also  provided.  We  suggest 
various  ways  to  group  the  large  set  of  labels  into  a 
smaller  set  of  classes,  depending  on  the  research  focus. 
Finally,  the  corpus  contains  information  that  may  be 
useful  in  for  developing  automatic  modeling  of  prosody, 
such  as  hand-marked  annotation  of  rising  intonation. 
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